This is an exploratory data analysis about the relationships between diabetes, obesity, leisure time inactivity, and the per capita income in the US counties. If you just want to see the results, see index.html.
# We use the following libraries
library(dplyr)
library(tidyr)
library(plotly)
library(RColorBrewer)
library(xlsx)
The US county income data is from Wikipedia page List of United States counties by per capita income. This diabetes dataset is from CDC Diabetes page County Data Indicators. The dataset contains statistics about diabetes, obesity, and leisure time inactivity in US counties between 2004-2013. Here, the data from year 2013 was used. Both datasets were downloaded on 2/3/2017.
The county names in both dataset were cleaned and made compatible so that the dataset could be joined.
# Cleaning the names
# =====================
# This function makes the county names coming from two sources,
# Wikipedia and CSC, compatible with each other so that we can
# join the datasets. This is ugly and could be simplified, but
# it makes the county names compatible.
clean_county_name <- function (county, state) {
county <- gsub('LaGrange', 'lagrange', county)
county <- gsub('Doña Ana', 'Dona Ana', county)
county <- ifelse(grepl(' Borough$', county), substr(county, 1, nchar(county)-8),
ifelse(grepl(' Census Area$', county), substr(county, 1, nchar(county)-12),
ifelse(grepl(' County$', county), substr(county, 1, nchar(county)-7),
ifelse(grepl(' Count$', county), substr(county, 1, nchar(county)-6),
ifelse(grepl(' Parish$', county), substr(county, 1, nchar(county)-7),
ifelse(grepl(' Pa$', county), substr(county, 1, nchar(county)-3),
ifelse(grepl(' [Cc]ity$', county), paste0(substr(county, 1, nchar(county)-4), 'city'),
county)))))))
county <- ifelse(state == 'District of Columbia', 'Washington city', county)
county <- sub('^(La Rue|Larue)', 'LaRue', county)
county <- sub('( K)? 2004 to 2008 Census', '', county)
county <- sub(' 200[89] Census', '', county)
county <- sub('De Witt', 'DeWitt', county)
county <- sub('De Soto', 'DeSoto', county)
county <- sub('Prince of Wales-Outer', 'Prince of Wales', county)
county <- sub(' City and', '', county)
county <- sub('Vermilion', 'Vermillion', county)
county <- ifelse(state == 'South Dakota' & county == 'Shannon', 'Oglala Lakota', county)
county <- ifelse(state == 'Alaska' & county == 'Wade Hampton', 'Kusilvak', county)
county <- sub(' Municipality?', '', county)
county <- sub(' Census', '', county)
county
}
# Reading and cleaning the diabetes datasets
# ==========================================
# This function reads an xlsx file and produces a csv file with nicer column
# names. The counties in Puerto Rico are dropped. The data is in wide format
# containing values from different years in each row. A long format is produced
# that has each value from each year in a different row. The long format data is
# written into csv files.
readAndCleanHealthData <- function (xlsxFile) {
csvFile <- sub('xlsx$', 'csv', xlsxFile)
rawdata <- read.xlsx2(xlsxFile,1)
write.csv(rawdata, csvFile)
rawdata <- read.csv(csvFile, stringsAsFactors = FALSE) %>% select(-1)
data_wide <- rawdata
headers1 <- as.character(colnames(data_wide))
headers2 <- as.character(data_wide[1,])
repeating <- headers2[4:10]
headers <- c(headers2[1:3], paste(rep(2004:2013, each=7), repeating, sep="."))
data_wide <- data_wide[2:nrow(data_wide),]
colnames(data_wide) <- headers
data_wide[4:73] <- lapply(data_wide[4:73], as.numeric)
#data_wide <- data_wide %>% filter(!(State %in% c("Alaska", "Hawaii", "Puerto Rico")))
data_wide <- data_wide %>% filter(!(State %in% c("Puerto Rico")))
wideFile <- sub('\\.xlsx$', '-cleaned-wide.csv', xlsxFile)
write.csv(data_wide, wideFile, row.names = FALSE)
data_wide[4:73] <- lapply(data_wide[4:73], as.numeric)
data_long <- gather(data_wide, key, value, 4:73) %>%
separate(key, c('Year', 'key'), sep="[.]", remove=TRUE) %>%
mutate(Year = as.numeric(Year),
county = clean_county_name(County, State))
longFile <- sub('\\.xlsx$', '-cleaned-long.csv', xlsxFile)
write.csv(data_long, longFile, row.names = FALSE)
}
readAndCleanHealthData('USDiabetes/DM_PREV_ALL_STATES.xlsx')
readAndCleanHealthData('USDiabetes/OB_PREV_ALL_STATES.xlsx')
readAndCleanHealthData('USDiabetes/LTPIA_PREV_ALL_STATES.xlsx')
# Reading and cleaning the income data
# ====================================
income2013 <- read.csv('Income/income-county.table', stringsAsFactors = FALSE, sep=";") %>%
filter(!(State %in% c('Country', 'State'))) %>%
select(-1) %>%
mutate(county = clean_county_name(County.equivalent, State))
income2013[3:7] <- lapply(income2013[3:7], function (x) as.numeric(gsub(',', '', sub('$', '', x, fixed = TRUE))))
write.csv(income2013, 'Income/income-county.csv', row.names = FALSE)
# Read income data and cut into population size categories
# 0-10000, 10000-100000, 100000-100000, and over 1000000 people.
# Make a new variable Income, which is per capita income in $1000.
income2013 <- read.csv('Income/income-county.csv', stringsAsFactors = FALSE)
SizeCategory <- cut(income2013$Population, breaks=c(0, 10^4, 10^5, 10^6, Inf),
labels=c("0-10K people", "10K-100K people", "100K-1M people", "over 1M people"))
income2013 <- income2013 %>%
mutate(SizeCategory = SizeCategory,
Income = Per.capita.income/1000)
# Read diabetes data and select year 2013 and the diabetes prevalence, which is in variable 'percent'
diabetes2013 <- read.csv('USDiabetes/DM_PREV_ALL_STATES-cleaned-long.csv', stringsAsFactors = FALSE) %>%
filter(Year == 2013 & key=="percent") %>%
rename(Prevalence = value)
To examine the possible relationship between diabetes prevalence and per capita income in different US counties, we join the datasets using the state and county.
# We do an inner join with the fields, 'State', and 'county'.
dminc2013 <- inner_join(diabetes2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
# Just take the complete cases.
dminc2013 <- dminc2013[complete.cases(dminc2013),]
We plot the data as a scatterplot having one point for each county. The colors of the points are based on the population size. A linear regression line is shown too.
# We use this function for plotting the data.
doPlot <- function(data, x, y, xlabel, ylabel, title, pal) {
plot_ly(data,
type = 'scatter',
mode = 'markers',
x = ~x,
y = ~y,
text = ~Label,
color = ~SizeCategory,
colors = pal) %>%
add_trace(x = ~x,
y = ~fitted(lm(y ~ x)),
type = 'scatter',
mode = 'lines-markers',
name = 'regression line',
color= I("black"),
line = list(color = "black")) %>%
layout(#title = title,
xaxis = list(title = xlabel, rangemode="tozero", showgrid = F),
yaxis = list(title = ylabel, rangemode="tozero"),
showlegend = TRUE)
}
# Apply it to diabetes vs. income data.
doPlot(dminc2013, dminc2013$Income, dminc2013$Prevalence, "Per capita income in $1000", "Diabetes prevalence (%)", "Diabetes prevalence vs. per capita income in US counties",
brewer.pal(6, 'YlOrRd')[3:6])
fit <- lm(Prevalence ~ Income, data=dminc2013)
#summary(fit)
There seems to be a pattern that diabetes prevalence is higher in counties with smaller per capita income.
Similarly we join the obesity data with the income data, and plot it.
obesity2013 <- read.csv('USDiabetes/OB_PREV_ALL_STATES-cleaned-long.csv', stringsAsFactors = FALSE) %>%
filter(Year == 2013 & key=="percent") %>%
rename(Prevalence = value)
obinc2013 <- inner_join(obesity2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
obinc2013 <- obinc2013[complete.cases(obinc2013),]
doPlot(obinc2013, obinc2013$Income, obinc2013$Prevalence, "Per capita income in $1000", "Obesity prevalence (%)", "Obesity prevalence vs. per capita income in US counties",
brewer.pal(6, 'YlOrRd')[3:6])
fit <- lm(Prevalence ~ Income, data=obinc2013)
#summary(fit)
Also here, we seem to have be a pattern that obesity prevalence is higher in counties with smaller per capita income.
Next we join the leisure time inactivity data with the income data, and plot it.
inactivity2013 <- read.csv('USDiabetes/LTPIA_PREV_ALL_STATES-cleaned-long.csv', stringsAsFactors = FALSE) %>%
filter(Year == 2013 & key=="percent") %>%
rename(Prevalence = value)
iainc2013 <- inner_join(inactivity2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
iainc2013 <- iainc2013[complete.cases(iainc2013),]
doPlot(iainc2013, iainc2013$Income, iainc2013$Prevalence,
"Per capita income in $1000", "Leisure time inactivity prevalence (%)",
"Leisure time inactivity prevalence vs. per capita income in US counties",
brewer.pal(6, 'YlOrRd')[3:6])
fit <- lm(Prevalence ~ Income, data=iainc2013)
#summary(fit)
The pattern is similar as in the cases above.
It is interesting to see, if there is a relationship between obesity and diabetes is. To do this, we join the obesity and diabetes data, and plot the result.
obdm2013 <- inner_join(obesity2013, diabetes2013, by=setdiff(intersect(colnames(diabetes2013), colnames(obesity2013)), "Prevalence")) %>%
mutate(Label = paste(County, ", ", State))
obdm2013 <- inner_join(obdm2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
obdm2013 <- obdm2013[complete.cases(obdm2013),]
doPlot(obdm2013, obdm2013$Prevalence.x, obdm2013$Prevalence.y, "Obesity prevalence %", "Diabetes prevalence %", "Diabetes vs. obesity prevalence in US counties.",
brewer.pal(7, 'YlGnBu')[3:6])
fit <- lm(Prevalence.y ~ Prevalence.x, data=obdm2013)
#summary(fit)
Higher prevalence of obesity seems to be related to higher prevalene of diabetes.
Similarly we examine the relationship between leisure time inactivity and diabetes.
iadm2013 <- inner_join(inactivity2013, diabetes2013, by=setdiff(intersect(colnames(inactivity2013), colnames(diabetes2013)), "Prevalence")) %>%
mutate(Label = paste(County, ", ", State))
iadm2013 <- inner_join(iadm2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
iadm2013 <- iadm2013[complete.cases(iadm2013),]
doPlot(iadm2013, iadm2013$Prevalence.x, iadm2013$Prevalence.y,
"Leisure time inactivity prevalence (%)", "Diabetes prevalence %",
"Diabetes vs. leisure time inactivity prevalence in US counties.",
brewer.pal(7, 'YlGnBu')[3:6])
fit <- lm(Prevalence.y ~ Prevalence.x, data=iadm2013)
#summary(fit)
Similar trend can be seen as above with obesity vs. diabetes.
Finally, we examine the relationship between leisure time inactivity and obesity.
iaob2013 <- inner_join(inactivity2013, obesity2013, by=setdiff(intersect(colnames(inactivity2013), colnames(obesity2013)), "Prevalence")) %>%
mutate(Label = paste(County, ", ", State))
iaob2013 <- inner_join(iaob2013, income2013) %>%
mutate(Label = paste0(County, ", ", State, "<br>Population: ", Population))
iaob2013 <- iaob2013[complete.cases(iaob2013),]
doPlot(iaob2013, iaob2013$Prevalence.x, iaob2013$Prevalence.y,
"Leisure time inactivity prevalence (%)", "Obesity prevalence %",
"Obesity vs. leisure time inactivity prevalence in US counties.",
brewer.pal(7, 'YlGnBu')[3:6])
fit <- lm(Prevalence.y ~ Prevalence.x, data=iaob2013)
#summary(fit)
Also here the trend is similar. Higher prevalence of leisure time inactivity seems to be related to higher prevalence of obesity.
Lots of manual work was caused by the different forms of county names. There were systematic differences like omission of the word ‘County’ in the county names in the Wikipedia data and non-systematic differences like ‘De Witt’ vs. ‘DeWitt’.
The plotly package was not completely easy to use. One particularly strange feature came out, when I added the linear regression lines to the pictures. For some reason, plotly wanted to add the marker points to the regression line too (projecting the x-values of the original points to the regression line). These markers completely hid the actual line. It did not matter, whether I defined mode="lines" or mode="lines+markers". Finally I tried mode="lines-markers" and it removed the markers! However, I did not find this in the documentation.